Data Mining

Tao ZOU

2024-01-13

This note include basic methods for Data Science and for other fields and algorithms. So the knowledge included in this markdown is introductory.

Data type and proximity measure

Nominal

\(attribute_1\) \(attribute_2\)
\(sample_1\) \(Yellow\) \(Purple\)
\(sample_2\) \(Red\) \(Red\)
\(sample_3\) \(Red\) \(Yellow\)
\(sample_4\) \(Black\) \(White\)

Proximity for attributes

\(d(attribute_1,attribute_2)=\frac{sample\_num-matches\_num}{sample\_num}=\frac{4-1}{4}\)

Binary

\(Gender\) \(Birthday\) \(Disease\) \(Old\) \(Sport\)
\(sample_1\) \(M\) \(singular\) \(Y\) \(Y\) \(like\)
\(sample_2\) \(F\) \(plural\) \(N\) \(Y\) \(like\)
\(sample_3\) \(F\) \(singular\) \(N\) \(N\) \(dislike\)
\(sample_4\) \(F\) \(plural\) \(Y\) \(Y\) \(dislike\)

Proximity for attributes

With respect to asymmetric binary data:

\[\begin{matrix}Disease/Old&Y&N\\Y&\color{red}2&\color{red}0\\N&\color{red}1&1\end{matrix}\]

\[d(Disease,Old)=\frac{\color{red}0+\color{red}1}{\color{red}2+\color{red}0+\color{red}1}\]

With respect to symmetric binary data:

\[\begin{matrix}Gender/Birthday&singular&plural\\M&1&0\\F&1&2\end{matrix}\]

\[d(Gender,Birthday)=\frac{0+1}{1+0+1+2}\]

Proximity for samples

With respect to asymmetric binary data:

\[\begin{matrix}sample_1/sample_2&1&0\\1&\color{red}2&\color{red}1\\0&\color{red}0&0\end{matrix}\]

\[d(sample_1,sample_2)=\frac{\color{red}1}{\color{red}2+\color{red}1+\color{red}0}\]

With respect to symmetric binary data: (Let M->1, Singular->1)

\[\begin{matrix}sample_1/sample_2&1&0\\1&\color{red}0&\color{red}2\\0&\color{red}0&\color{red}0\end{matrix}\]

\[d(sample_1, sample_2)=\frac{\color{red}2+\color{red}0}{\color{red}0+\color{red}2+\color{red}0+\color{red}0}\]

Ordinal

\(Class\_Rank\) \(Rich\_Rank\)
\(sample_1\) \(1\) \(poor\)
\(sample_2\) \(5\) \(medium\)
\(sample_3\) \(4\) \(rich\)
\(sample_4\) \(3\) \(rich\)

\[\downarrow interval-scaled\]

\[\begin{matrix}&Class\_ Rank&Rich\_Rank\\sample_1&1&1\\sample_2&4&2\\sample_3&3&3\\sample_4&2&3\end{matrix}\rightarrow\begin{matrix}Class\_ Rank&Rich\_Rank\\\frac{1-1}{4-1}=0&\frac{1-1}{3-1}=0\\\frac{4-1}{4-1}=1&\frac{2-1}{3-1}=\frac{1}{2}\\\frac{3-1}{4-1}=\frac{2}{3}&\frac{3-1}{3-1}=1\\\frac{2-1}{4-1}=\frac{1}{3}&\frac{3-1}{3-1}=1\end{matrix}\]

Interval

Ratio

Mixed

Mixed_DataType

Distance

  1. Minkowski Distance \[d(\vec{x}_i,\vec{x}_j;q)=\Big{[}\sum_{k=1}^d{(x_{ik}-x_{jk})^q}\Big{]}^{\frac{1}{q}}\]

Manhattan Distance: \(d(\vec{x}_i,\vec{x}_j)=\sum_{k=1}^d{|x_{ik}-x_{jk}|}\)

Euclidean Distance: \(d(\vec{x}_i,\vec{x}_j)=\big{[}\sum_{k=1}^d{(x_{ik}-x_{jk})^2}\big{]}^\frac{1}{2}\)

Maximum Distance: \(d(\vec{x}_i,\vec{x}_j)=\max_\limits{1\leq k\leq d}|x_{ik}-x_{jk}|\)

  1. Manhattan Distance \[d(\vec{x}_i,\vec{x}_j)=(\vec{x}_i-\vec{x}_j)^T{\sum}^{-1}(\vec{x}_i-\vec{x}_j)\]

Clustering

Clustering method is used to classify different samples into given number of categories.

K-means clustering

  1. Given \(k\) initial points randomly.

  2. Calculate distance between all samples and the \(k\) points.

  3. Distribute every sample to its closest point of the \(k\) points, thus forming \(k\) clusters.

  4. Updating the \(k\) points by calculating central positions of \(k\) clusters.

  5. Repeat step 2 to step 4 until the \(k\) points don’t change any more.

import numpy as np
from sklearn.cluster import KMeans

# Generate ndarray of shape (100, 3)
X = np.random.uniform(0, 10, size=(100, 3))
X = np.round(X, decimals=2)

# initiate k=3
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

labels = kmeans.labels_  # cluters for every sample
centers = kmeans.cluster_centers_  # center coordinates of every cluster

for i in range(len(X)):
    print("sample {} belongs to {}".format(X[i], labels[i]))

for i in range(len(centers)):
    print("center coordinate of cluster {} is {}".format(i, centers[i]))

Hierarchy clustering

The 5 distance definitions of clusters in hierarchy clustering is shown on the pictures below.

import numpy as np
from scipy.cluster import hierarchy
import matplotlib.pyplot as plt


# Calculate a distance matrix. This matrix should be either upper triangular or lower triangular matrix with out diagonal values.
half_distance_matrix = np.array([[0, 0, 0, 0, 0],
                                [2.33, 0, 0, 0, 0],
                                [3.15, 1.30, 0, 0, 0],
                                [1.90, 1.50, 3.70, 0, 0],
                                [3.01, 0.47, 1.40, 1.82, 0]]).T

# np.tril_indices(half_distance_matrix.shape[0], -1) returns values' indices of lower triangular matrix without the diagonal values
# np.triu_indices(half_distance_matrix.shape[0], 1) returns values' indices of upper triangular matrix without the diagonal values
# hierarchy.linkage() only receives a compressed 1-dimension array.
compressed_distance_matrix = half_distance_matrix[np.triu_indices(half_distance_matrix.shape[0], 1)]
linkage_matrix = hierarchy.linkage(compressed_distance_matrix, method='single')

plt.close('all')
def link_color_func(*args):
    return 'cornflowerblue'
dendrogram = hierarchy.dendrogram(linkage_matrix, labels=['1', '2', '3', '4', '5'], link_color_func=link_color_func)
plt.xlabel('Samples Indices')
plt.ylabel('Distance')
plt.title('Hierarchical Clustering Tree')
plt.show()

The argument method can be “single”, “complete”, “average”, “centroid”, “ward” with respect to the figures above.

The calculation for similarity matrix would be computationally expensive.

Density-based method

Eps: Maximum radius of the neighborhood.

MinPts: Minimum number of points in an Eps-neighborhood of that points.

Core point: Point with enough density.

Border point: Point without enough density but is neighbor of core point.

Directly density-reachable: Point \(p\) is directly density-reachable from point \(q\) if and only if \(q\) is a core point and \(p\) is a neighbor of point \(q\).

Density-reachable: Point \(p\) is directly density-reachable form point \(q\), and point \(o\) is directly density-reachable from point \(p\). Then \(o\) is density-reachable from \(q\).

Algorithm (DBSCAN):

  1. Arbitrary select a point \(p\).

  2. Retrieve all points density-reachable from \(p\).

  3. If \(p\) is a core point, a cluster is found. If not, visit the next point of database.

  4. Continue the process until all of the points have been processed.

Fuzzy clustering

  1. Randomly choose \(k\) samples to initialize \(k\) clusters.

  2. Calculate probability of each sample belongs different clusters by the formula below and thus forming partition matrix.

\[p_{ij}=\frac{\frac{1}{dist^2(i,j)}}{\sum_{j\in J}{\frac{1}{dist^2(i,j)}}}\], where \(i\) represents sample \(i\), and \(j\) represents cluster \(j\).

partition matrix:

\(sample_1\) \(sample_2\) \(\cdots\) \(sample_n\)
\(cluster_1\) \(p_{11}\) \(p_{21}\) \(\cdots\) \(p_{n1}\)
\(cluster_2\) \(p_{12}\) \(p_{22}\) \(\cdots\) \(p_{n2}\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\ddots\) \(\vdots\)
\(cluster_m\) \(p_{1m}\) \(p_{2m}\) \(\cdots\) \(p_{nm}\)
  1. Updating cluster by the formula below.

\[x_j=\frac{\sum_{i\in I}{x_ip_{ij}^2}}{\sum_{i\in I}{p_{ij}^2}}\], where \(x_j\) represents the x-axis coordinate of cluster \(j\). This formular should be applied to every axis of a cluster.

  1. Repeat step 2 and step 3 until the partition matrix converges.

Compared to K-means, fuzzy clustering has advantages:

  1. Less sensitive to outliers. The centroid updating is influenced by probability instead of distance.

  2. Less sensitive to initial clusters.

But the shortcoming is that fuzzy clustering is computationally expensive as calculating a sample’s probability to every clusters.

import numpy as np
import skfuzzy as fuzz

data = np.random.uniform(0, 10, size=(100, 4))

n_clusters = 3
fuzz_cmeans = fuzz.cluster.cmeans(data.T, c=n_clusters, m=2, error=0.005, maxiter=1000)  # fuzz_cmeans is a tuple including 7 items.
fuzzy_centroids = fuzz_cmeans[0]
fuzzy_probs = fuzz_cmeans[1]  # partition matrix with shape (3, 100)

print(fuzzy_centroids)  # coordinates of n_clusters clusters
print(fuzzy_probs.argmax(axis=0))  # each sample's cluster

m in fuzz.cluster.cmeans() is another parameter which is set to 2 in order to keep consistent with the formulas above.